Conversation
There was a problem hiding this comment.
Pull request overview
This PR enables FP16 (half-precision floating-point) support for the GELU (Gaussian Error Linear Unit) activation operator in ONNX Runtime opset 20. The implementation provides optimized compute paths using ARM SVE (Scalable Vector Extension) and NEON intrinsics for both tanh and erf approximation methods, with fallback to scalar FP32 computation when vector intrinsics are not available.
Key changes:
- Adds FP16 kernel registration for GELU operator alongside the existing FP32 implementation
- Implements optimized FP16 ERF and TANH kernels using ARM SVE and NEON intrinsics
- Adds comprehensive test coverage for both tanh and erf approximation modes with FP16 inputs
Reviewed changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 27 comments.
Show a summary per file
| File | Description |
|---|---|
| onnxruntime/core/providers/cpu/cpu_execution_provider.cc | Registers typed GELU kernels for float and MLFloat16 types |
| onnxruntime/core/providers/cpu/tensor/gelu.cc | Implements FP16 GELU computation with SVE/NEON optimizations and scalar fallback |
| onnxruntime/core/providers/cpu/math/element_wise_ops.cc | Adds FP16 ERF operator support using new SVE/NEON kernels |
| onnxruntime/test/providers/cpu/activation/activation_op_test.cc | Adds FP16 GELU tests for both tanh and erf approximations |
| onnxruntime/core/mlas/lib/tanh.cpp | Adds SVE path for FP16 tanh computation |
| onnxruntime/core/mlas/lib/sve/mlasi_sve.h | Declares SVE FP16 function signatures |
| onnxruntime/core/mlas/lib/sve/mlas_sve_fp16.h | Adds SVE FP16 intrinsic wrapper functions |
| onnxruntime/core/mlas/lib/sve/Elementwise_sve_fp16.cpp | Implements SVE FP16 tanh, erf, and GELU kernels |
| onnxruntime/core/mlas/lib/fp16_common.h | Adds NEON FP16 helper functions for erf computation |
| onnxruntime/core/mlas/lib/erf.cpp | Implements NEON FP16 erf kernel |
| onnxruntime/core/mlas/inc/mlas.h | Exports NEON FP16 erf kernel function |
| cmake/onnxruntime_providers_cpu.cmake | Adds ARM FP16 compile flags for gelu.cc and includes MLAS headers |
| cmake/onnxruntime_mlas.cmake | Adds SVE FP16 elementwise source and compile flags for erf.cpp |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
ca56982 to
cc2625d
Compare
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 19 comments.
Comments suppressed due to low confidence (1)
onnxruntime/core/providers/cpu/math/element_wise_ops.cc:2034
- The allocator is retrieved but no longer used after switching to the native FP16 ERF implementation. The lines getting the temp space allocator (lines 2032-2034) should be removed as they are now unnecessary.
// get allocator for temporary buffers
AllocatorPtr alloc;
ORT_RETURN_IF_ERROR(context->GetTempSpaceAllocator(&alloc));
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Seperate platform dependant code
3ccdca0 to
cf6d83f
Compare
|
@hariharans29 we have pushed the code to resolve all the above comments |
Please manually "resolve" Copilot's comments and add comments if you think Copilot's suggestion is not applicable and you re not taking it in. Thanks. |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 17 out of 17 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
Hi @hariharans29 , |
There are still some unaddressed Copilot comments. Please manually resolve them with a comment as to whether going with Copilot's recommendation or not. |
|
Hi @hariharans29 , |
As stated above, please resolve all Copilot comments and my old comments. Resolving comments is a gating check for merging a PR. I ll start another round of CI. |
Thanks, will resolve them. |
|
Hi @hariharans29, we pushed the latest commit with the CI fixes. Could you please trigger the CI pipeline? |
|
/azp run Linux QNN CI Pipeline,Win_TRT_Minimal_CUDA_Test_CI,Windows ARM64 QNN CI Pipeline,Windows GPU Doc Gen CI Pipeline |
|
Azure Pipelines successfully started running 4 pipeline(s). |
Thanks. Thanks for taking a look at most of Copilot's comments and manually resolving them too. There are a few more remaining. Please take a look. I llk take another look at this PR soon. |
| @@ -0,0 +1,182 @@ | |||
| /*++ | |||
| Copyright 2025 FUJITSU LIMITED | |||
There was a problem hiding this comment.
Can you also add the Microsoft copyright along with this in the new files please ?
| */ | ||
| void | ||
| MLASCALL | ||
| MlasGemmBatchPackUseKleidi(bool enable); |
There was a problem hiding this comment.
Don't think this existys anymore. Merge issue ?
Enabled fp16 Gelu for opset20.Gelu uses tanh and ERF functions depending on the approximation method used. Implemented tanh in sve and erf in sve and neon .
Gr3E results: with tanh and erf approximation:
Gr4 results: with tanh and erf approximation:
This PR is a joint contribution by:
Aruna K(@akote123)
Abhishek Jain(@abhijain1204fujitsu)
Sanket Kale(@sanketkaleoss )